-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] [Frontend] Make detokenization optional #3749
[Core] [Frontend] Make detokenization optional #3749
Conversation
Quick code snippet to test this:
yields (skipping the tqdm bar):
With a larger list of prompts and larger |
+1 on this feature - in practice this feature can benefit a lot of use cases when the detokenization is not necessary and only the token ids themselves are needed for downstream tasks. We also observe roughly 10-15% speedup on vllm/vllm/engine/llm_engine.py Lines 419 to 421 in 9c82a1b
|
After digging some more, I don't see an easy way to support this through the OpenAI server, at least while maintaining close compatibility wih their API. Instead, I propose adding token input and output to the (non-OpenAI) API server. I've added a proposed implementation to this PR. The API server now takes either |
Thank you! And yeah, the other PR was opened almost the same minute as mine, otherwise we could have coordinated. They are indeed somewhat orthogonal though. |
@mgerstgrasser The non-OpenAI API server is actually deprecated already and no longer accepting changes. IMO if OpenAI API does not support token ids only as output, then we should keep it as a feature on the engine level instead of server level (at least for this PR). I believe the goal of the OpenAI API server is the ability to serve an OSS model as a drop-in placement for OpenAI endpoints, so I think we should try to include as little vLLM custom logic there as possible. Plus, there's no restriction for users to build their own API servers with just |
Yes, absolutely! I kind of did the API server changes for my own project primarily (so exactly as you suggest, easy to build your own API server if needed). I figured I'd be happy to share that, but also don't mind removing it from the PR! |
This reverts commit 3dd6255.
@mgerstgrasser Hello! Thanks for making this PR - IMO overall the logic looks good to me, but do you mind adding some tests just as a sanity check? The other thing I'm not sure about is whether we should enforce |
Yes, absolutely! Is there anywhere existing where these could fit in well? I don't see any end-to-end tests where I could easily add something.
Hm, it's already implicitly enforced, because |
Perhaps a new file under
Oh yes that's right - could you leave a Lines 141 to 143 in b95047f
|
b98a52b
to
fedea30
Compare
Done, and done! I see a few checks failed now, but as far as I can tell due to internal errors? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me thanks @mgerstgrasser @ywang96
@mgerstgrasser looks like you need to run |
Co-authored-by: Nick Hill <[email protected]>
@njhill I did, not sure why it failed earlier, but the simplification commit triggered a re-run and seems to be good now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgerstgrasser Thank you again for working on this PR!
Of course! Thank you both @ywang96 @njhill for being so quick and responsive with reviewing, and for all your work on this fantastic project in general! |
Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
FIX #3635
This PR makes detokenization optional during generation. The changes here are the minimal ones required to enable this: We add an optional
detokenize
argument toSamplingParams
which defaults toTrue
. InLLMEngine
we then skip detokenization ifdetokenize
is set toFalse
for the current request. In that case,CompletionOutput.text
is returned empty, butCompletionOutput.token_ids
contains the generated token IDs.One thing I don't know is how (if at all) we could make this work with th OpenAI API server, at least in combination with the
openai
python client library. If anyone has ideas on that, or of course any other feedback, I'd be happy to make changes!